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Abstract 


A  general  purpose  multiprocessor  should  be  scalable,  i.c..  show  higher  performance  when  more 
hardware  resources  arc  added  to  lire  machine.  Architects  of  such  multiprocessors  must  address  the  loss  in 
processor  efficiency  due  to  two  fundamental  issues:  long  memory  latencies  and  waits  due  to 
synchronization  events.  It  is  argued  that  a  well  designed  processor  can  overcome  these  losses  provided 
there  is  sufficient  parallelism  in  the  program  being  executed.  The  detrimental  effect  of  long  latency  can 
be  reduced  by  instruction  pipelining,  however,  the  restriction  of  a  single  thread  of  computation  in  von 
\cumann  processors  severely  limits  their  ability  to  have  more  than  a  few  instructions  in  tlic  pipeline. 
Furthermore,  techniques  to  reduce  the  memory  latency  tend  to  increase  the  co.st  of  task  switching.  The 
cost  of  synchronization  events  in  von  Neumann  machines  makes  decomposing  a  program  into  very  small 
tasks  counter-productive.  Dataflow  machines,  on  the  other  hand,  treat  each  in.struction  as  a  task,  and  by 
paying  a  small  synchronization  cost  for  each  in.struction  executed,  offer  the  ultimate  flexibility  in 
seheduling  instructions  to  reduce  processor  idle  time. 

Key  words  and  phrases;  caches,  cache  coherence,  datafiow  architectures,  hazard  resolution,  instruction 
pipelining,  LOAD/STORE  architectures,  memory  latency,  multiprocessors,  multi-thread  architectures, 
semaphores,  synchronization,  von  Neumann  architecture. 
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Two  Fundamental  Issues  in  Multiprocessing 


1,  Importance  of  Processor  Architecture 

Parallel  machines  having;  up  to  several  ilo/'m  processors  arc  commercially  available  nova  Most  ol  lhc 
designs  are  based  on  von  Neumann  processors  viperating  oul  of  a  shared  memory.  The  dilferences  in  ilic 
architectures  of  these  machines  in  terms  of  processor  spe'cd,  inemorA  oreani/aiion  and  commurieaiion 
systems,  are  significant,  but  they  all  use  relatively  conventional  von  Neumann  processors.  These 
machines  represent  the  general  belief  that  processor  architecture  is  of  liiUc  in  noriance  m  dc'igr.ing 
parallel  machines.  We  will  show  the  fallacy  of  this  as.^umption  on  the  basis  ol  two  i ssiics:  tncnii‘r\ 
latency  and  synchroni^utian.  Our  argument  is  ba.sed  on  the  following  observ'ations: 

1.  Most  von  Neumann  processors  are  likely  to  "idle"  during  long  meniorr  references,  .md  such 
references  are  unavoidable  in  par.dlel  machines. 

2,  Waits  for  synchronization  events  often  require  task  switching,  which  is  expensive  oi’  ■  or, 
Neumann  machines.  Therefore,  only  certain  types  of  parallelinn  can  ;x'  exploited 
efficicnlly. 

We  believe  die  etfect  of  these  issues  on  perfomiance  to  be  luodamcntal,  .and  '.o  a  large  degme. 
orthogonal  to  the  effect  of  circuit  technologv,  Wc  will  argue  that  bv  ’eogiung  tlic  processcr  properly,  the 
detrimental  effect  of  memory  latency  on  performance  ctin  he  reJucea  provided  there  is  parallelism  in  the 
program.  However,  techniques  for  reducing  the  effect  of  latency  tend  to  increase  the  s’- nchropi/aiion 
cost. 

In  d\c  rest  of  this  section,  we  aniculate  our  assumptions  rcgavdir'.g  general  purjsose  parallel  computers. 
We  then  discuss  the  often  neglected  is,suc  of  quantifying  the  amount  of  parallelism  in  programs. 
Section  2  develops  a  framework  for  defining  the  issues  of  latency  and  synchronizatuni.  Scciuin 
examines  the  mctliods  to  reduce  die  effect  of  memory  latency  in  von  Neumann  computer^  and  discusses 
their  limitations.  Section  4  similarly  examines  synchronization  methods  and  their  cost.  In  Section  5,  wc 
discuss  multi-ihrcadcd  computers  like  HEP  and  the  MIT  Tagged  Token  Datallow  machine,  and  show 
liow  these  machines  can  tolerate  latency  and  synchronization  costs  pro\  ided  there  is  suiilci  mt  parallelism 
in  programs.  The  last  section  summarizes  our  conclusions 

1.1.  Scalable  .Miiltiproces.sors 

Wc  arc  primarily  interested  in  general  purpose  parallel  emnputers.  i  c  ,  roniput'Ts  that  can  exploit 
narallclism,  when  present,  in  any  program  Funher,  we  want  multiprocessors  to  Iv  scalahlc  in  such  a 
manner  that  adding  hardware  resources  results  in  higher  perfomiance  u.ii!k':ji  icuuiriiig  changes  m 
application  programs.  The  focus  of  the  paper  is  not  (>n  arbitrari!)  large  machines,  hut  machines  which 
'ange  in  size  from  ten  to  a  thousand  processors.  We  expect  the  processors  lo  be  at  least  as  powerful  as  the 
current  microprocessors  and  [xissibly  as  powerful  as  the  CPI  's  of  the  current  surx'rcrmputcrs  In 
particular,  the  context  of  the  discussion  is  no:  machines  with  millions  ol  one  bit  ALU's,  devens  of  which 
may  fit  on  one  chip.  Idic  design  of  such  machines  will  certainly  involve  fundamental  issuo^  m  addition  to 
tho.sc  presented  here.  Most  parallel  machines  that  arc  available  today  or  likely  to  Ix’  available  in  the  next 
few  years  fall  within  the  scope  of  this  paper  (e  g  ,  the  BBN  Butterfly  |36|,  Al.lCL  1.^1  and  now 
I  LACSHIP,  the  Cosmic  Cubc|.^Kl  and  Intel's  iPSC,  IBM's  RIM  12^1,  Alliant  and  CLl)AR|2hl,  and 
CRIP  !  1  1 1), 

If  die  priigrammmg  model  ot  a  (xirallcl  machine  reflects  the  machine  configuration,  c  c  .  number  of 
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processors  and  interconnection  topology,  the  machine  is  not  scalable  in  a  praciical  sense  Chancing  ;he 
machine  configuration  should  not  require  changes  in  appl. cation  programs  or  s\sicm  soltuare;  up  laiing 
tables  in  die  resource  management  system  to  rellcci  the  new  configuration  should  be  sullicient.  Hnucicr. 
few  niuUiproccssor  designs  have  taken  this  stance  with  regard  to  scaling.  In  fact,  u  is  not  uncomnicin  to 
find  that  source  code  (and  in  some  ca.ses,  algonihms)  must  be  rnodilied  in  order  to  run  on  an  altered 
machine  configuration.  Figure  1  depicts  the  range  of  effects  of  scaling  on  ihc  software.  Obs  iousl).  we 
consider  achitectures  that  support  the  scenario  at  tlic  nght  hand  end  of  the  scale  to  be  far  more  desirable 
than  those  at  the  left.  It  should  be  noted  that  if  a  parallel  machine  is  not  sealable,  ihen  it  w  ill  probably  noi 
be  fault-tolerant;  one  failed  prtKCssor  would  make  the  whole  machine  unusable.  It  is  easy  to  design 
hardware  in  which  failed  components,  e.g  .  processors,  may  lie  masked  oat  However,  if  the  application 
code  must  be  rewritten,  our  guess  is  that  most  u.sers  would  wait  for  the  original  machine  configuration  to 
bo  restored. 

1.2.  Quantifying  Parallelism  in  Programs 

lilcaliy,  a  parallel  machine  should  speed  up  the  cxecutmn  ol  a  program  in  proponior  to  the  numtx'r  oi 
processors  m  the  machine.  Suppose  tin)  is  the  time  to  execute  a  piogram  on  an  n-processor  machine 
The  s[X’ed-up  as  a  function  of  n  may  be  dcllncd  as  follows:  s 

f(l) 

■[U’l  J-upin)= - 

tin) 

Spera-up  IS  clearly  dependent  upon  the  program  or  programs  ^lio.scri  lor  the  measuremeni  Vaiurai''. .  ;l 
.1  orrgr.im  dtx's  not  have  "sufficient  parallelism,  no  parallel  maclime  can  be  exix'CieJ  to  denu'iisiraic 
drirr..':;,'  s;x'edup  ITius,  in  order  to  evaluate  a  parallel  machine  properly,  we  need  lo  characien/e  ihc 
mhcieii!  or  [xitential  parallelism  of  a  program.  This  presents  a  ditliculi  problem  because  the  amount  (M 
■parai'clism  in  the  source  program  that  is  exposed  to  the  archileciurc  may  depend  upon  the  qualiiN  of  the 
■;  .  i'.'r  or  programmer  annotations,  Funhermore.  there  is  no  reason  to  iissnme  ihai  the  source  program 
,1  ,n.  I  ix’  changed,  rndoubtedly.  different  algorithms  a  j'.rol'iem  h.i\e  dilleieni  .inn  anis  ol 
e  1  .  oil.  and  the  parallelism  of  an  algonthm  can  K’  u:,- !  n  coding  fhe  problem  is  eompMumied 
:  e  '.ni  ih.ii  most  programming  l.inguages  do  not  li.oe  en.sich  e\pressi\e  (xv.u'i  to  show  ,tll  the 
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Figure  2;  Parallelism  Prorilc  of  SIMPLE  on  a  20  x  20  Array 


possible  parallelism  of  an  algorithm  in  a  program.  In  spite  of  all  these  difficulties,  we  think  it  is  jixjssible 
to  make  some  useful  estimates  of  the  potential  parallelism  of  an  algorithm. 

It  is  possible  for  us  to  code  algorithms  in  ld[30|,  a  high-level  dataflow  language,  and  compile  Id 
programs  into  dataflow  graphs,  where  the  nodes  of  the  graph  represent  simple  operations  such  as  fixed 
and  floating  point  arithmetic,  logicals,  equality  tests,  and  memory  loads  and  stores,  and  where  the  edges 
represent  only  the  essential  data  dependencies  between  the  operations.  A  graph  thus  generated  can  be 
executed  on  an  interpreter  (known  as  GITA)  to  produce  results  and  the  parallelism  profile,  pp(t).  i.e.,  the 
number  of  concurrently  executable  operators  as  a  function  of  time  on  an  idcali/cd  machine.  The  ideali7ed 
machine  has  unbounded  processors  and  memories,  and  in.stantaneous  communication.  It  is  further 
assumed  that  al!  operators  (instructions)  lake  unit  time,  and  operators  arc  executed  as  .soon  as  possible. 
The  parallelism  profile  of  a  program  gives  a  good  c.stimatc  of  its  "inherent  parallelism"  because  it  is 
drawn  assuming  the  execution  of  two  operators  is  sequentialized  if  and  only  if  there  is  a  data  dependency 
between  thim.  Figure  2  shftws  the  parallelism  profile  of  the  SIMPLE  code  for  a  representative  set  of 
input  data.  SIMPLE  [  12|,  a  hydrodynamics  and  heat  flow  code  kcmcl,  has  been  extensively  studied  both 
analyticalK  j  1]  and  by  experimentation. 

The  solid  curve  in  Figure  2  represents  a  single  outer-loop  iteration  of  SIMPLE  on  a  20  x  20  mesh,  while 
a  typical  simulation  run  performs  UX),000  iterations  on  100  x  100  mesh.  Since  there  is  no  significant 
parallelism  between  the  outer-loop  iterations  of  SIMPLE,  the  parallelism  profile  for  N  iterations  can  be 
obtained  by  repeating  the  profile  in  the  figure  N  times.  Approximately  7.S%  of  the  instructions  executed 
involve  the  usual  anthmclic,  logical  and  memory  operators;  the  rc.st  are  miscellaneous  overhead 
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uio's,  Muiie  of  thorn  peculiar  to  daiallow.  One  can  caMly  deduce  the  parallelism  profile  of  an\  set  of 
.'peru'.ors  fivnii  the  raw  data  that  was  used  to  generate  the  profile  in  the  figure;  however,  classifying 
ojxnaiors  as  overhead  is  not  easy  in  all  cases. 

Tlie  reader  may  visualize  the  execution  on  n  procesMTs  b\  drawing  a  horizontal  line  at  n  on  the 
:  ar.iilelism  profile  and  then  "pushing"  all  tlie  instructions  which  are  atsave  the  line  to  the  nght  and  helou 
ti.e  line  The  da.shed  curve  in  Figure  2  shows  this  for  SIMPLE  on  1000  processors  and  was  generated  In 
cr  'fififlo'.v  graph  interpreter  by  executing  the  program  again  with  the  constraint  that  no  mt'rc  than  >: 

'w  -.'r'er  '-  were  to  be  performed  at  any  step.  However,  ,.  gci.'-d  estimate  for  nn)  can  be  mti.L',  \  er\ 

.nespensively,  from  the  ideal  parallelism  profile  as  follows.  For  any  t,  if /i/)(t)<n,  we  perform  all  /yxTi 
ops-rations  in  time  step  T.  However,  if  pp(x)>n.  then  we  ti.ssume  u  wili  take  the  least  integer  greater  than 
;-p‘  z  !,'n  steps  to  perform  pp{x)  operations  Hence, 

r-(  n 

^  v\x  number  of  steps  in  the  ideal  pardileli>m  proiiie  ( )iir  estimate  of/ini  is  coriserc  ati\ e 
oec.ixse  the  data  dependencies  in  the  program  may  jx-nvni  fne  execution  of  some  msirueimns  trom 
'T  •  1 ,  .,1  the  last  time  step  in  v.hi>.h  iiisiruclions  from  /yuTi  are  exei.uisal 

in  our  ilataHow  graphs  the  number  of  instructions  exocsued  does  not  change  when  the  pnv,"  uvi  is 
■' xe.  uted  on  a  diftcrcnt  number  ot  processors.  Hence.  /•  1  ■  is  sm  ply  ihe  urea  under  the  p.i’aM''!  -•n  p-  'fle 

'a  I'  ,  an ni(i(  \pcr  l-iv'>( n)=[(  ]  )/t(n)  ■ -i  '  r  SIMI’I  F,  as  siei.,,  n  m  i  ,>'  ; 

r'e,  in  the  case  ol  240  processors,  sperci-up  n-  -  u  u:!l::.rinn  is  sr.;  Ouu  w.o  n 
in  lewi  in.l  utilizutinnin)  is  th.at  a  program  has  n  parallel  o[>ei:i;i.';is  lor  on!'.  :iiili:aLiin^n)  l:.ie:,.',:i  o!  us 
I  .'i:  duration 


..i.u  :<  a.'gued  tfiat  ihis  problem  does  mil  have  eiiougti  (\ioiiieiisni  i,'  kci.  p,  sa\,  KKHi  prone 
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uuli/.ed.  On  Llic  oilier  hand,  if  wc  cannot  keep  10  processors  fully  uiilized,  we  cannot  biaaie  the  lack  of 
parallelism  in  Uie  program.  Generally,  under-utili/ation  of  the  machine  in  the  presence  of  massive 
parallelism  stems  from  aspccLs  of  the  imemal  architecture  of  the  processors  which  preclude  exploitation 
of  certain  types  of  parallelism.  Machine.s  are  .seldom  designed  to  exploit  inner-loop,  outcr-locip,  as  well  as 
insiruciion-lcvcl  parallelism  simultaneously. 

It  is  nolcwonhy  that  the  potential  parallelism  varies  iremendoiislv  during  execution,  a  behavior  utiich  in 
our  experience  is  typical  of  even  the  most  highly  parallel  programs.  We  b-ehcvc  Uiat  any  large  niograrn 
that  runs  for  a  long  time  must  have  sufficient  parallelism  to  keep  hundreds  of  pioces.  '.  r^  mili/cd;  several 
applications  that  wc  have  studied  support  this  belief  However,  a  ['arallel  machine  lias  it'  be  tamlv  general 
imipose  and  programmable  for  the  u.scr  to  be  able  to  expre^^  c\'cn  th.  ■  ehi:  <  oi  vmnd  c;f''.Tcn!ia' 
ctiuaiion-bciscd  simulation  programs  represented  by  SiMPl.f, 

2.  I>atencv  and  Synchronization 

Wc  now  discuss  the  issues  of  latency  and  synchroni/.ation  WV  believe  'aic'iicv  m  st  '•■ronglv  a 

function  (if  the  physical  decomposition  of  a  niulliproccss' a.  while  syiichMm/an.'n  n'  w:  c.ningly  a 
function  of  how  pmgrams  arc  logically  dccotnpc'scd. 

2.1.  Latency:  The  First  Pundtiniental  Issue 

•Any  tiiulliprocessor  orgam/ation  can  tv  thought  of  as  an  itiicrconnviion  o:  I’nc  loliow  ing  tiiice  types  of 
modules  (see  Figure  4): 

1.  Processing  elements  (I’K):  .Modules  which  perform  arithmetic  and  logical  opciatic'ns  .m 
d.ita.  Facii  processing  element  has  a  sing'c  communication  port  through  which  a!i  data 
values  arc  received  Processing  clcnicrits  i.ueract  wiili  other  processing  elements  by 
sending  messages,  iSNUing  intermpis  or  .sending  and  receiving  synchronizinp.  signals  llirough 
sli.ired  memory.  PF’s  interact  with  memory  elements  by  issuing  1/)AD  atid  STCiRL 
in.-i  'uciui;;.'  modified  av  necessary  with  atomicity  c('nsirainis.  Processing  elcmimts  are 
cha!;wien/ed  by  die  rate  at  which  they  can  process  instructions.  .As  mentioned,  we  a.ssume 
the  instructions  are  Mmple.  <  e  .  Hsod  and  lloating  point  scalar  aridimclic.  More  complex 
iii'fuctions  ^:an  k-  counted  as  multiple  instructions  for  measuring  instruction  rale. 

:  Me  niory  elements  (Mi;  Modules  which  store  ilaia.  Fach  memory  element  has  a  sinclc 
c  onummu  .itii 'll  pxiri  Mcni"ry  element'  respKMul  to  requests  issued  by  the  processing 
eiemeniv  by  re'unung  d.it.i  fir.-’igh  the  communication  pnn.  and  are  characteri.scd  nv  Llieir 
e.ip, Icily  u"  1  'he  rate  .it  w! aef.  ificy  re-.jsind  lo  these  requests-, 

•  ( 'mnmunie.iliun  vleinents  (G  t:  Modules  wfuvh  transport  data.  F.,ich  lantriv  ial 
c.'mm'unie.itii'n  element  f,.i-  .it  lea't  three  communication  ports.  Communication  e,emenis 
neidicr  ‘  .tie  la a  K\e;'.e  o. neh-rvm/"i:;  'ignals,  iPstruetlons,  ni  data:  rather,  thev 
relr.uw'm:  '  :,h  mh'rm.iiii'i.  .vlier:  le.'civeU  on  one  of  the  eommunieaiK'n  pons  to  one  or 
mor,-  n  ihe  e'hei  •.  onimuni,  .iiion  pons  Communication  elements  are  eharaeteri/ed  bv  the 
i.ite  ot  !i.i[i'-mic«ioi',  lii.j  :  ne  t.ikeo  per  ir.iricmi'su'n,  and  the  constraints  inifvoscd  b\  one 
iMii  n  ,c-s  ;■  "i  oiluT-  ,  .  hic'^kmi:  I  he  m.iximum  aiiKnint  of  data  that  m.iy  be  enn  eved 

:  on  jn.i']  y  ^r;  ivt  unit  i'li  e  m  lived 

/  <;'  .  1-  ilu.  noK  .vl'ii  h  e'apse-.  Iviween  making  a  rev|uest  and  receiving  the  associated  response.  'Fhe 
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Figure  4:  Stmclural  Model  oi  a  Muliii'^roecssor 


aho'.  c  model  implies  that  a  PE  in  a  multiprocessor  system  pu  .  v  iir'ccr  latency  in  memory  rcferen.\  s  [/..in 
n  Li  uraproeessor  sxstem  because  of  the  transit  time  in  the  comnianicaiion  neiuork  bciueen  PF's  and  ilie 
memories.  The  actual  intcrconnecLion  of  modules  may  difler  greatly  from  machine  to  machine.  I  or 
a  sample,  in  tlic  BBN  Butterfly  machine  all  memory  elements  aie  at  an  equal  distance  from  all  processors, 
uliile  in  IBM’s  RP3,  each  proccs.sor  is  closely  coupled  with  a  mcmorc  element.  However,  we  assume 
liiat  the  average  latency  in  a  well  designed  n-PE  machine  should  be  Oilogin}).  In  a  von  Neumann 
i'rocessiir,  memory  latency  determines  tlie  time  to  execute  memory  reference  instructions.  Usually,  the 
average  memory'  latency  also  determines  the  maximum  insirem.ion  prt'cessing  speed.  \k'>icn  latency 
.aim  n  tv  hidden  via  overlapped  operations,  a  tangible  perfonnance  penalty  is  incurred,  W'c  call  the  cost 
is'Ociau'd  w  uh  latency  as  the  total  induced  processor  idle  tune  aiiributable  to  the  latency. 

2.2.  Synchronization:  The  Second  Fundamental  Issue 

'A  c  will  call  the  basic  units  of  computation  into  which  program:-  .irc  decomposed  for  parallel  exccuiion 
■  ‘inpi.iuticnal  tasks  or  simply  tasks.  A  general  model  ol  parallel  programming  must  assume  that  tasks 
.*re  -.reated  dynamically  during  a  computation  and  die  after  having  produced  and  consumed  data. 
.s,;;:.'iions  in  parallel  programming  which  require  task  synchroni/ation  include  the  following  basic 

■  pv-i'anons: 

\ .  i''-odurcr-Consumcr-.  .\  task  produces  a  data  'I’licturc  ihai  m  read  by  another  la^k.  If 
producer  and  consumer  tasks  arc  executed  in  parallel,  svnctironi/aiion  is  needed  lo  avoid  ilic 
read-hefore-w-rite  race 

2  f-arks  and  Joins:  Tlic  join  operation  forces  a  synchrnni/.iiion  event  indicaling  that  Iwo  tasks 
which  had  been  started  earlier  by  some  forking  operation  have  in  laci  completed. 

■  Mutual  Exclusion:  Non  deterministic  events  whieti  nnis;  ix'  imocesscd  one  at  a  time,  c  c  . 
s<  nali/ation  in  the  use  of  a  resource. 

li.c  niinimai  support  lor  synchronization  can  Iv  provided  ti\  nKiuding  instructions,  such  as  aiomic 
i  t  \NU  sn,  that  operate  on  variables  shared  by  syiichr'im/mg  tasks'  However,  to  clarify  the  true  cost 


'A,..:,'  xtnclly  necessary,  atomic  o|v;aii.'ns  .such  .is  iisi  \ni  m  i  .tK  ..fio  i  v.  liVv-uiuni  hasc  u  'Ahufi  im  huilvl 
T)'  .Mrt'Ui/.jiio:!  ofXT3ti‘>ns  See  .Sccimn 
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Figure  5:  Operational  Model  ol  a  Multiprocessor 


of  such  instructions,  wc  will  use  the  Operational  Model  presented  in  Figure  5,  Tasks  in  the  operational 
model  have  resources,  such  as  registers  and  memory,  associated  with  them  and  con>titaie  the  smallest  unit 
ol  independently  schedulable  work  on  the  machine.  A  task  is  in  one  ot  the  three  states;  ri  ady-to-exceutc. 
exeeuitny  or  suspended.  Tasks  ready  Tor  execution  may  be  queued  locally  or  globally.  When  selected,  a 
task  occupies  a  processor  until  either  it  ct'mpleles  or  is  suspended  waiting  Tor  a  synchroni/aiion  signal.  A 
task  changes  from  suspended  to  ready-to-exeeute  when  another  task  causes  the  relevant  s\nchroni/ation 
event.  Generally,  a  suspended  task  must  be  set  aside  to  avoid  deadlocks’*.  The  cost  associated  with  such 
,1  sMichroni/.ation  is  the  fixed  time  to  execute  the  synchronization  instruction  plus  the  time  taken  to  sxviteh 
t"  another  task.  The  cost  of  task  switching  can  be  high  because  it  usually  involves  saving  the  processor 
i.ite.  that  IS,  the  context  associated  willi  tlic  task. 


a  nriMdiT  ihc  c.cse  of  a  single  pnn'cssor  system  which  u.  isl  cxccule  n  cooporaliiif;  Ia.sks 
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I'horc  are  several  subtle  issues  in  accounting  for  syncli  «>n.i/atic'  co. ,  ^  ‘.n  event  to  cii,;  ,  '>r  Ji'p  ach  a 

task  needs  a  name,  such  as  that  of  a  register  or  a  mi''ni'rv  '.)Cj"on,  and  thus,  synchrom/  aion  cost  should 
also  include  the  instructions  that  generate,  match  and  muse  :de’'i:!'eis  which  name  s> nchroni /.moii 
events  It  may  not  he  easy  to  identify  the  instructions  evecuted  lor  this  punxise  Nevcntic'csv,  such 
.psi ructions  mpreseni  overtioad  because  they  would  not  he  present  i;  me  program  worn  wniten  to  cxeaitc 
on  a  single  sequential  processor.  The  harrlware  design  usually  ilietaies  the  number  of  names  available  lor 
^yncaroni/ation  as  well  as  the  cost  of  thetr  use 

'I'hc  other  subtle  issue  has  to  do  with  the  accounting  for  intra  ..Knchmnization.  As  we  shall  see  iii 
iection  3,  most  high  performance  computers  overlap  tne  CKccuiion  t  t  insiaictions  belonging  to  one  task 
The  techniques  used  for  synchronization  of  instructions  m  such  a  situation  c  e  .  instructior  lisp  e.b  uid 
^uspensioni  are  often  quite  different  from  techniques  for  inter  task  synchronization.  It  is  usauio  satci  and 
ehcaper  not  to  put  aside  the  instruction  waiting  for  a  syneha'inizuiion  event,  but  rather  to  idle  lor. 
■uve.’.valenlly,  to  execute  NO-OP  instructions  while  wamng).  This  is  usually  done  under  me  assumption 
.nat  me  idle  lime  will  be  on  the  order  of  a  few  lastruction  eyeks.  \Se  delinc  the  synchroni/.alion  cost  m, 
ueb  ^:tiai’...)r.s  to  be  the  induced  processor  idle  time  aiinhuiable  to  waiting  lor  the  synchronization  event. 


3.  Pi  oces.sor  Architectures  to  Tolerate  Latency 

In  this  section,  we  describe  those  changes  in  von  Neumann  arcluteciuies  ih,a  have  directly  reduced  the 
cifect  (if  memory  latency  on  performance.  Increasing  die  processor  state  and  instruction  pipelining  arc 
the  two  most  effective  techniques  for  reducing  the  latency  cost.  I  sing  (  ray  - 1  (perhaps  the  best  pipelined 
n  ..,  !r,;ic  design  to  date),  wc  will  illustrate  that  it  is  difficult  to  keep  more  than  4  or  .s  instructions  in  the 
p.p^lir  ,'  t  V  ,  .1  •  Ncuiriann  processor.  It  will  be  shown  tiial  every  change  in  the  processor  architecture 
.cb.  has  permitted  overlapped  execution  of  instructions  h.is  necessitated  introduction  of  a  cheap 
>>tKl:roni7.ation  mcchani.sm.  Often  tlicse  synchronizaiion  mechanisms  are  hidden  from  the  user  and  not 
uv.'d  tor  imcr-ia.sk  syrichronizaiion.  This  discussion  will  further  illustrate  that  reducing  latency  frequently 
.ncrea.scs  synchronization  costs. 

Before  describing  these  evolutionary  changes  to  hide  lateney.  wc  should  point  out  that  the  memor\ 
vy  stem  i  multiprocessor  selling  creates  more  problems  than  just  inerea,scd  kucncy.  Let  us  assume  that 
all  memory  modules  in  a  multiprocessor  form  one  global  address  space  and  that  any  processor  can  read 
my  wo'd  in  the  global  address  space  Tliis  immediately  hrir.gs  up  ilte  Uiliowing  problems: 

•  1  he  time  to  fetch  an  operand  may  not  he  coastani  because  some  memories  may  be  closer" 
than  otliers  in  the  physical  organization  of  the  machine 

•  No  u.scful  bound  on  the  worst  ca.se  time  to  fetch  an  ojv  ind  may  be  possible  at  machine 
design  time  because  of  the  scalability  assumption  I  his  is  at  odds  w  ith  RISC  designs  which 
treat  memory  access  time  as  bounded  and  fixed 

»  : "  pnv  essor  w'crc  to  issue  several  (pipelined)  r.ienior .  ivquestx  to  different  remote  memory 
iiiodule-,  the  respoascs  could  arrive  out  of  order. 

.Mi  ot  these  issues  arc  discussed  and  illustrated  in  the  following  sections.  A  general  solution  for 
.mcepiing  memory  respoascs  out  of  order  requires  a  synchroni/aiion  mechanism  to  match  respoascs  w  iUi 
ilie  desimalion  registers  (names  in  the  task’s  eontcxli  and  the  in.struciions  waiting  on  Uiat  value.  The 
■ii  i.ii.’  l  rvnelcor  HHP(2.‘i]  is  one  of  the  very  few  arch''eetun's  winch  has  provided  such  mcch.inisnis  in 
i.ne  '.o:!  .Neumann  framework,  flowever,  the  architecture  oi  t  ic  HLP  is  suificiently  different  from  von 
Ncuni.i:  n  arcbileclurcs  as  to  warrant  a  .separate  di.scu.ssion  (see  Section 
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I'leiirc  h  (iopii'i'.'  ihc  niodom  ihiv  \  ic'A  ■')’  ihe  \(''i  Vi  ini.in;'  ri'iiipuirr  !‘•' !  l./Oi.  In  the  earliest 

eoin[’ii!lers.  such  as  hDSAC'.  the  pr,‘c,\ssi)r  suHf  erMisisie'l  si'lch  c  an  aeeiiniulaior.  a  quaiicnt  register, 
and  a  program  ev'unter.  Memories  uere  relatively  slow  com  pa  a,  d  to  me  nioeessoi-s,  and  thus,  the  time  to 
letch  an  instruction  and  its  ojX'rands  completely  dominated  the  insimi.no, i  c;.cle  time.  Speeding  up  the 
Arithmetic  Logic  I’nit  uas  of  little  use  unless  the  memory  access  time  could  .  l-o  be  reJi'ce  ' 

d'he  appearance  of  multiple  "accumulalors"  reduced  the  number  of  operand  letches  :md  stores,  and 
index  registers  dram  uically  reduced  the  numlxm  of  instiaictions  executed  by  essentia  l  .  Lhmin.itmg  the 
need  for  sclbmodify mg  code  Si'-,e  the  memory  irafnc  was  drastic  i'A  Icuer.  [vag’emis  eveemed  mii.h 
fa^ier  ;hati  Nebire.  Ilowescr,  the  eni  irgcd  prc'cessor  state  did  I'l  ■  redo  e  liic  ’(■•a  don:  .  c.;;  .  a  r\ 
relere  iiees  ind,  ss  nscqucntly,  del  not  emitribute  to  an  over.f!  u'd,m!:oi-  m  ■;>,.!(:  time,  to-  'i■■  's;c  iii.^ 

impioeed  only  uilh  mq  ro-.x  men's  m  circuit  '.[sc  -ils 

3.2.  Instriiction  !*refeteliing 

The  time  taken  by  instruction  fetch  'and  pertiaps  part  of  in.struction  dcci.vdmg  erne)  ca.i  cnaliy  lidder. 
if  prefetching  is  dof-.c  duiine  tire  execution  phase  of  the  previous  insinjcia.m  !  r-nm  a  '  v  d  data  are 
kept  in  scr'arate  memories,  ii  is  pKissible  to  overlap  instruction  prel'etching  aiid  ojierind  fetching  also. 
(The  IB-M  STRF.TCIl  I7j  and  L'nivac  LARC  j  Ihj  represent  tw'o  of  I'nc  e,  'iieti  attemr'  at  inirileireniing 
this  idea.l  Prefetching  can  reduce  the  cycle  time  of  the  machine  by  f.sepiy  to  ■.hu  ts  fKircen'.  Ccivnding 
upon  the  amount  of  time  taken  by  the  fiiM  two  steps  of  the  instruciio'-;  cycle  wi  r  rcoc.  i  u  the  complete 
cycle.  However,  the  effeciivc  throughput  of  ifie  macliinc  cannot  iiicrease  proponionately  Ivcause 
overlapped  execution  is  not  possible  with  all  instructions. 

Instruction  prcfciching  works  well  when  the  execution  ol  instruction  n  docs  net  iiave  any  eff.vt  on 
either  ilie  •choice  of  instructions  to  fetch  'as  is  the  case  in  a  rrancil  or  the  conicni  sif  bhe  fetched 

instruction  (self-modifying  code)  for  iasiructions  n+l.  n+2 . n  +  k.  The  latter  ca.se  is  u^ujliy  Imndlcd  by 

simply  outlawing  it.  However,  effective  overlapped  execution  in  tlte  presence  of  HK.ANf  H  insTUvtions  has 
remained  a  prohlem.  Techniques  such  as  prcfctching  fx.sth  br.xnch  targets  have  shown  little 
perlomiancc/co.st  benefits.  Lately,  the  concept  of  delayed  BRANCH  insiracliotis  frcim  microprogramming 
has  been  incorporated,  with  success,  in  LOADASTORF.  architectures  (sec  Section  3.4).  The  id'ca  is  to  delay 
the  effect  of  a  branch  by  one  instruction.  Thus,  the  instruction  at  'i+ !  following  a  branch  instruction  at 
n  is  always  executed  regardless  of  whieii  way  the  BRANCH  at  n  goes.  One  can  aluays  follow  a  BRANCH 
instruclion  vvitii  a  N'N-OP  instruction  to  gel  the  old  effect.  However,  experience  h.is  sliO'An  that  seventy 
percent  ol'lhe  lime  a  useful  insiruction  can  be  pui  in  that  position. 

3.3.  Instruction  Buffers,  Operand  (Caches  and  Pipelined  Execution 

'fhe  time  to  fetch  instructions  can  be  further  reduced  by  providing  a  fast  insin;cticii  -..ff.-  In  machines 
such  as  ihe  C'DC  6fi(X)  (-U)i  and  die  Cray-1  |.17|,  Ihc  instruclion  buffer  is  amoniamc.iliy  loaded  with  n 
insiruclioiV'  in  the  ncightwrtiood  of  the  referenced  mslmction  (relying  on  spatial  lotality  in  code 
references),  wlicnever  the  referi’nced  instn.'ction  is  found  to  h<"  messing.  To  take  adcantaue  of  instruclion 
buffers,  it  IS  also  necessary  to  speed  up  the  operand  letch  and  execute  phases.  'Lhis  is  mually  done  by 
providing  >)nerand  caches  or  buffers,  and  overlapping  the  operand  fetch  and  execution  phases^.  Of 
course,  babmeing  the  pipeline  under  llie.se  conditions  may  require  further  pipelining  of  the  .4LU.  If 
successful,  these  techniques  can  reduce  the  machine  cycle  time  to  one-founh  or  one-fifth  the  cycle  time  of 
•111  un|ii[X'l!iii'd  mashine.  Howover,  overlapped  execution  of  four  lo  five  instniclions  in  the  von  Neumann 
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Figure  6:  The  von  Neumann  Processor  ( I  roin  Gjjski  and  Peir  |201) 


traniework  presents  some  serious  conceptual  diniculiics,  as  discussed  nevt 

Designing  a  well-balanced  pipeline  requires  that  die  lime  taken  b>  vanous  pipeline  stages  be  more  or 
less  equal,  and  that  the  "things",  l  e ,  in.structions,  entering  the  pipe  be  independent  of  each  other. 
Obviously,  instructions  of  a  program  cannot  be  totally  inde|x.’ndeni  except  m  some  special  tnviai  cases. 
Instructions  in  a  pipe  are  usually  related  in  one  of  two  wa\s  Instruction  n  produces  data  needed  h\ 
instruction  n+k.  or  only  the  complete  execution  ol  instruction  n  determines  the  next  in.struclion  to  be 
executed  (the  aforementioned  branch  problem). 

L-imitations  on  hardware  resources  can  also  cause  instructions  to  interfere  with  one  another.  Consider 
the  ca.se  when  both  instructions  n  and  n-t-/  require  an  adder,  but  there  is  only  one  of  these  in  the  machine. 
Obviously,  one  of  the  instructions  must  be  deferred  until  tfie  other  is  complete  A  pipelined  machine  must 
be  temporarily  able  to  prevent  a  new  insimction  from  entering  the  pipeline  when  there  possibility  of 
interference  with  the  instructions  already  in  the  pipe.  Detecting  and  quickly  resolving  these  hazards  is 
\cry  difficult  with  ordinary  instruction  sets,  c  ^  ,  IBM  170.  VAX  1  I  or  Motorola  bK(XX).  due  to  their 
complexity. 

^  major  complication  in  pipelining  complex  instructions  is  the  variable  amount  of  time  taken  in  cacti 
stage  ot  instruction  processing  (refer  to  Figure  7).  Operand  letch  m  the  VAX  is  one  such  example: 
dctcmiining  the  addressing  mode  for  each  operand  requires  a  lair  amount  ol  decoding,  and  actual  fetching 
■an  involve  0  to  2  memory  references  per  operand.  Considenng  all  possible  addressing  mode 
combinations,  an  instruction  may  involve  0  to  6  memory  references  in  addition  to  the  instruction  tcich 
Itself  A  pipeline  design  that  can  effectively  tolerate  such  vanalions  i>  close  to  impossible 

3.4. 1.oad/.Slore  Architectures 

Seymour  Cray,  in  the  sixties,  pioneered  instruction  sets  ((  DC  noOii,  Cray  1  i  which  separate  msiructioii'- 
into  two  disjoint  classes.  In  one  class  are  instreictioiw  whic'-  me-  e  urn  /i.i'iccd  between  memory  a.u! 
high  speed  registers.  In  the  other  class  arc  instructions  which  ojx-raie  on  data  in  the  registers  Insmictioiis 
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of  llie  second  class  cannot  access  the  memorv  This  noui  di.-linciuMi  MeifdiMc'  c  so  .e.ion  s^LeduIine 
For  each  icslruction,  it  is  trivial  to  see  il  a  memory  leference  wii'  !>•  n.',,e-  ',  :\  I'r  '■  \ic.reover,  the 
memory  system  and  the  Al.F'  may  be  viewed,  as  paiallcl,  noreniersc  e-'’  ;-!p',’'''-,.,-s,  ,\n  in'irticnon 

dispatches  exactly  one  unit  of  work  to  either  one  pipe  or  die  other,  inn  ;ie\ :  Fo'n 

Such  architectures  have  come  to  be  known  as  l oad  .sroRh  architectures  and  mchid'  the  •  '.achines  built 
by  Reduced  Instruction  Set  Computer  (RISC)  enthusiasts  (the  IRM  SOI  ;'0|,  Reikeiey’s  RfsiC  !  C'j,  and 
Stanford  MIPS  [22]  are  prime  examples).  l.OAD/STORt  architectures  ua'  the  time  he'sseen  instruction 
decoding  and  instruction  dispatching  for  hazard  detection  and  rcsoluiitm  i  ncc  I-igun.  M.  The  design  of  the 
instruction  pipeline  is  ba.sed  on  the  principle  that  if  an  instruction  gets  p.isi  I'lrc  tlx.  '  p'js.-  stage,  n 
should  be  aole  to  run  to  completion  without  incurring  any  previously  unaniis  :)\iied  h.i/,irds 

l.t ),-xn;.ST()RE  architectures  are  much  better  at  tolerating  latencies  in  n  I'-ia  rx  accesses  th.in  other  von 
Neumann  architectures.  In  order  to  explain  this  point,  we  will  first  dismiss  a  sinifi'died  m.idel  which 
detects  and  avoids  hazards  in  a  LOAD/.SfORE  architecture  similar  to  the  C’rav-1  Assume  there  is  a  bit 
associated  with  every  register  to  indicate  that  the  contents  td  the  register  arc  nndergeing  i  change,  Ihe  bit 
corresponding  to  register  R  is  set  the  moment  we  dispatch  an  instnicnon  that  .vaiits  to  update 
R,  Following  this,  in.structions  are  allowed  to  enter  the  pipedine  onlv  d  diey  don'i  need  to  reference  or 
modify  register  R  or  other  registers  reserved  in  a  similar  way  Whoneve:  a  value  is  stored  in  R,  the 
reserv'ation  on  R  is  removed,  and  if  an  instruction  is  waiting  on  R,  it  is  al'ov.  d  to  pmceed  this  simple 
scheme  works  only  if  we  assume  that  registers  whose  values  are  needed  by  an  instniction  arc  read  before 
the  next  instruction  is  dispatched,  and  that  the  AITJ  or  the  multiple  functicMial  umiv  within  the  ALF  arc' 
pipelined  to  accept  inputs  as  fast  as  the  decode  stage  cati  supply  them^.  The  dispatching  of  an  lastructton 
can  also  b<.’  held  up  bccau.se  it  may  require  a  bus  for  storing  results  in  a  clock  cycle  when  the  bus  is 
needed  by  another  instruction  in  the  pipeline.  Whenever  RRANCII  instructions  arc  eiuountered,  the 
pipeline  is  effectively  held  up  until  the  branch  target  has  tK'cn  decided. 

Notice  what  w  ill  happen  when  an  instruction  to  load  the  contents  of  some  memory  location  M  into  some 
register  R  is  executed.  Suppose  that  it  takes  k  cycles  to  fetch  something  from  the  memory.  It  will  be 


^’IiiUcLsl,  in  iho  rray-l,  fiinclional  units  ean  accept  an  input  every  clock  cycle  and  regislers  are  always  read  in  one  clock  cycle 
afier  an  instru  lion  is  dispalclictl  from  the  Deeixter. 
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Figure  8:  Hazard  Avoidance  at  U)c  Insinn  luui  iXcodc  Stage 


''■■.i'ic  U'  eceeute  several  instructions  during  these  s  a-  Ion;:  a\  t'onc  of  them  refer  to  register 

in  j;.ci.  this  situation  is  hardly  different  from  the  one  in  \si.i  :  k  o  lo  he  loaded  from  some  fiinciional 
ill  di  I.,  ,'kc  the  F'ioating  Ptimt  multiplier,  lakes  scvera'  e>^ie'  lo  nniduce  the  result.  These  gaps  m  the 
[x.-line  '.  an  fx;  lurthcr  reduced  if  the  compiler  reorders  iiMruci'OM'.  "leh  that  instructions  consuming  a 
leiv,  aie  pul  as  far  as  possible  from  msiruciions  prodii. m::  iii,ii  d.tuir.t.  Thus  we  notice  that  machines 
snnicd  lor  high  pipelining  of  instructions  can  hide  large  nieoior.  i  lit  neies  pnonded  there  o  loeal 
:,''.eiisin  amonsi  instructions''. 

.il  l  if;,  r  |S.'..n.  .  ‘  .  .e  vv ,  !atviu._>  co't  has  isiii  ii  ,  ■  ■  ,  i.m .01:'.'.  a  tneao  sMieOion  .  .iiuir; 

,  .11  reser. aluiii  hn.s  on  pri'cessor  registers  ;i,iue\ii,  ;  .r  nunihei  ol  'uinws  cO.iihhle  lor 


iF'  ;  i  .  It  r.-mriji'T  Iwo  insiTUcH' u'kuaiiv  means  ih;?!  'lipse  I'lsir,.., 


I 


nch:t'iii, Mlioii.  .  ('  .  i!k’  vi/i.'  c'l  the  task  's  [■'riHs-'M'i  tsuind  siMiiisi.  is  jMW  isci',  ih^'  iiumt  ci  ol  rayisicrs, 
.<!id  tills  rcNiru'i'-  the  amoutu  ol  s'Si'lvataMi-  jiarai:  ;i ■ a'..*  lolarar'ii-  i  t'.-n.  \  lii  unk  t  U'  n.larsiatul  this 


'.^^as'  cvili^uls'l  tiK'  .  sslk'’>  au  V  a  liP' I  .’I  .s  I'.k' a'  '  (V  ,  '  ,d  S’  1  'Sjsiit  ^alLK's  jt 

luo  di I k'd'.ii  iiisiriK'iu'iis  sas ,  ana  1  hi'  v. id  iLMuitr  at. a i  ■-  ’V'.v  jk'-'  ■  aiU'."  ■v^liiiani' 

suaii  I'ldai  nia\  lia\o  Ivon  iiat'iK'd  t's  ihi.-  simua-  usta  •  ■.  ,  .a  -a.  ■  ■■■'ad  o  i,k-al 

saiili  this  class  pmhls'iiis  In  I.kI.  shailnu.  ::pi-lTv  ;'a  an  a''.'  n,  ,  :  a  .a’-nsa',  ■.<  ,  la’,;, 

anainaami'a  pinblani  I'lia  real  issue  is  1  ha  rx’ason  .Imi  ■  :  i  >  .  a  ,  :  :  1  ; 

ir.ipti'\as  Ilia  situation  densas  iii'iii  iha  addition  C'l  lavplkii  af  .l  i-.’  h  i;  1  ,  ii”  .■  "  ,  .a'  ’U  an 


har.aa,  a  ataaiiar  'ppa'nunax  lor  lolaraMip  lalsak  v 

Siinia  ! '  ! '  i  I  !K  I  aialHiasturas  iiasa  a'-'uinatai'.  tha  ii,.  ,1  i’  I  ivs  i\ 
si'inpiia:  '.’spans  hia  tor  sahovtulini:  insiruatioiis,  snah  il  ai  I'r.  .  ■s.’h  •  ’i  :  '  n  '  . 

sompilar  s  111  pa  rMnn  hazard  rasoltilion  only  il' tha  tuna  ,(’r  s’ kh  1  jX'rat ’,  ,n  x  i  , 

insotas  \i !  ji’  I'lstruattons  wharavar  ncaassara  Hccatisa  the  iti-tia,.  tion  av’.s.  at,  :  ,a  .  ,  .  :a 

i',:n  01  tha  oi'i.'si  atslc,  dn\  change  K'  the  nn.^hine's  'Uaalcra  '.  .iha’:.  -.a;..’  an 
icvia.rc  eh  '’gcs  to  tiie  compiler  and  regenaration  oi  i'k-  code  I'ins  is  o:o!i  isle  ,  •  a.: 
ganer.’.  :i;  ,  and  landers  the  poilabilits  ol  soiiv.a'c  hom  one  t'."'.. '  itii'H  id  m,;  aa  la  to  ■ 

C'linvfil  !  ()’\iD  ^  I  ( .'Kt:  architeetiires  a  .same  tied  niam.'rv  lelemncas  i.,'!'  ’;  ;>  ■  .  ..  ,  \  aa 

’  na  cycle  m  most  RISC’  niachmesi  tn  that  tliey  tase  a  vanuMe  hat  pra.aaia'.h  .i,-  c,  .  ,  ■  ts  m  i.ha 

('ra_\-l  '  In  ,S('  mach'na.,  'his  lima  derived  (in  li.;  ha.-is  .da  s' '•  *'  '  u  '  :  a  ...  '  .  .•ni'.l  r  Ik 

rnissina  In  in  the  cache,  the  pipeline  stops,  Hqiiivalently,  one  eaii  i!,  ',k  c-!  ihi-  ..  ■  ■  .nion  v  *k  ra 
clock  s>cl',  is  strct  h'.d  tci  tlie  tinie  required.  I'nis  so'.utr.’n  w,crl  •  h  .  ■  ''a'-;  c;  I'mc  ;■  innes, 

there  can  ha  cither  one  or  a  ver>'  sin. ill  number  ;rl  ir.en'ory  rcferaiK  .  in  p’'-.erass  ,  ,  ,,  ,  :."vcii  m  l-or 
example,  in  the  Cras-I,  no  more  th.in  lour  mdeixuideni  addresses  ^ m  "<■:  canciMt.d  ,!’,i"i;e  .,  ,  "nor. 
c\cle.  If  the  generated  address  causes  a  b.ink  .  omiui,  the  pqvlinc  is  stojipc’,  iii"z.e’  cr.  .o  .  .a  n  lict  is 
resolved  in  at  most  three  cycles, 

LOAD/STfiRE  architectures,  because  of  their  simpler  insinKtions  .ilien  exci.a;  '  IV'i  :  ‘’i)'  more 

instructions  than  tnachines  with  more  complex  'ns’ruciions  l  vt y  I'ns  ;!ivi,...s.  ,  ,  ■  irgardeJ  as 

synchronization  cost  Mowever.  |t;is  ,s  easd;,  C'in'i''’ns.ued  hs  improvements  m  in.  i  speed  m.ide 

possible  by  simpiar  control  mechtinisnis. 

4.  ,S\ nc'liponi/ation  Methods  for  Ntul'iproccssino 
4.1.  (Ilobtil  Sehediiiinjj  on  .Synchronous  inaclii.o  s 

F'or  .1  tot.ill>  s;> nchronoiis  muIitproccssC’i  i;  is  ;h'\sih!,  ..-ir  isioii  ,i  m  '..sie'  ’  '.e  .i  e  h  qvedies 

o[vr.r,ions  tor  cverx  cvcie  on  every  processor.  An  analoux  san  Na  m.ule  Ix'tvseen  tv  mr.i '’neng  sioh  a 
multinroee- sor  and  ioditig  a  tiorizonialb  mieioprogr.imn'. :!  maehm,-  F.'ac,  '  .i',  '  .  -  m  c*  "ipdir.' 

IS'  I’.as'c  made  sueh,  coi'e  ‘laneration  leasible  and  .'neour.iga  1  resi. .nairv  .  in  piopscse  arii  bmld  scseral 

dillerent  synchronous  n  ultiprc'cessors.  Cy drome  .md  Muliiluau  ecvripulets  \ihi,,h  ate  based  on  prt'po-als 

m  .  and  !  Id'  r.-spccH'-eh,  a:a  ('xamples  ol  sUs’t:  mjshmi.  ,  'I'h  ■  •  ma  inn-'s  "e  i.i  eem'ly  leteirad  to 
.IS  V(  r\  l-ir,  I  mdru  n  voril,  or  v'IdW,  machines,  ixeausc  c.(’..h  instiuti'.'i!  .iciu  id\  co,  '.iins  multiple 
sm.iher  instruction-  (one  per  tuirctiona!  unit  or  processing  elemer.'  >  'i  he  sti  aieay  is  btiscd  i  n  maximizing 
the  i;se  ol  lesoiirees  and  resolving  polennal  luii  time  conllicts  in  llie  use  ol  resoiiises  u  compile  time 
Memory  rmerences  and  eivilrol  imnsrcrs  are  "anlici[iated"  as  in  RISC'  arehiteclur'cs.  but  lere,  multiple 
^oiK  arii.'iit  ihre.ids  ol  v'ompulal'.on  are  be'ing  scheduled  instead  of  only  one,  CJis.  n  the  [xissibiliiy  ol 
dceodi.ig  a.iid  .nitiaiing  many  instructions  in  parallel,  such  architectures  an.'  highly  appealing  when  one 
realizes  tha'  the  lasicsi  machines  available  now  still  essentially  decode  .aid  dispateb  insirui.  lions  one  at  a 
lime 
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VS  a  iv'l..  '.e  uui  tins  technique  is  etleetive  m  b.  ,Miae\i,  < 

,\in  puia;ion>  nn  a  small  numtxT  (4  tu  Ki  ot  pr-xes\ors  < q  .i.ue  '■!  ;  aianelism  beyond  ilus  level, 
'uiaever.  bcv  omes  intractable  U  is  unelear  hose  pu'niems  vsmeh  mis  m,  dvnaniie  ,  iragc  ail'valioii  or 
axuiire  ru'nJe'eiTninisuc  and  real-time  constraints  ssill  plas  vOii  or  j; ai  Jutei-lures 

4.2  Inter  rupts  and  I.ow-level  ( Ontexf  Svsitchin” 

A '.On  'I  all  sivn  Neumann  mai^hinc'  are  w:apable  I'i  .a  .  a,  .  ,  -t  .i,.:-  m!errurs;>  No!  suiyi  ismels , 

I'  ll:.;"  oe^^o^s  based  on  such  machines  [viTiiii  tile  ..>0  .  '  o  ,.t  '''.icevv a;  inieiTupts  as  a  me. ills  tm* 
:  'senis  Hosseser,  inienrupl.s  are  rauher  es[i.  :  •  ■.  ...  ■  cs  ,  tl;e  isnxes'Or  state  needs 

1.,  x..\od  The  state-savine  mas  Ix'  lorecvi  bs  die  ha:  'ss.in-  .  .  ..;ree;  eonseqtienec  o;  ailo\  me  ilie 
'.err.::';  i''  >xeur,  tu'  it  may  occur  c\[>heitls .  ;  e  .  undei  tlx-  .or.r.-  ■■■'  me  ['roerammer.  sia  a  sii  ele  sers 
rmio'S's  o.struciion  or  a  suite  ol  less  complex,  imcs.  liidejX-iKle'U  ol  p-ie  state  sasme  hap(X'ns.  the 

a  'lima  to  note  is  that  each  interrupt  ssill  m  n.eraie  a  'leriiaani  .uiiounl  of  iraltie  across  the 
,  r  memors  interlace. 

ii  I  s  :'re'.  lous  discussion,  see  concluded  that  lareer  prciees'oi'  stale  eood  because  il  provided  a  means 
..  1  re. '.acme  memors  latency  cost.  In  iryine  to  solve  the  proli;..  tr,  o!  ic'vs  cost  ssnehrom/ation.  \se  base 

.  .0111;.'  across  an  interaelion  sshieh,  sve  believe,  is  mort  ih  j  i  -a  coincidental  SpcciticaMs .  in  s  cry 

■jn'  .I"!  Aeumaiin  proees.sors,  the  "obvious"  synehroni/anon  meeliai'ism  (inierriipts)  scili  only  v^o^k  vsell 
■  .11  sa.se  ot  ml'retiuent  sy nehron  /.aliviii  esetiis  or  vvi.e,;  me  emc'uni  (-;  piwessor  stale  sshieh  must 

.'s'  ..se.l  •(■'■s  vrruj//.  -Said  another  ujy,  reduemit  the  sOst  ol  sy:ichi.:;ii/ai;on  t"'S  making  mierrjpis 

.  aenerally  entail  increasing  the  cost  ot  memory  Kueiicv. 

'  irs  such  .es  die  Xerox  .MtoidZ!,  the  Xerox  Dorado  "e  .ind  die  Symbolics  Timily 

'  ■  ■  'I'-ed  .1  technique  sstiieh  may  be  called  mii  'I'ciuiflf-t !  'mjbn,'  s\\:tihin^  lo  allosv  sharing  ol 
Ti  '-..'m-.e  by  the  1/f)  device  adapters.  This  is  .iccrinipioiied  by  duplicating  programmer-visible 
'■s.  :  'iVicr  words,  the  proeessi'r  state.  Thus  In  cmc  mk'rr  i.nstruct'.'.v.i 'le:  processor  can  be  switched 
.  •  .  -.-.k  without  causing  arn  memory  rclerenecs  to  ..ive  iiie  c-oecssor  slaic*^.  This  dramatieallv 

'  '  'c  .  'si  ol  processing  certain  lyjxts  of  events  ihal  .  arise  ircsj’i.  nt  mierrupts,  .Xs  far  as  we  know  , 

'  1  I  lOted  the  idea  of  keeping  multiple  conic v'e-  m  a  m'lliijiroeessor  setting  (with  the  possible 

,,  :  ,:ie  MIT  !('  be  diseusscil  m  Section  ‘'i  alihougli  i:  should  reduce  synchrom/atioii  sOs;  over 

A'  .^h  can  hold  only  a  smgle  context,  li  mas  'v  w-.-.rli  ilnnking  aUsui  adopting  this  scheme  lo 
.  .Os!  0*  .1  nonlocal  memory  ivietvikcs  .o  .v  !: 

..m  vis  ot  this  appnvaeh  ,ire  obmous  Ibu.i  .x  r'.  .nee  processors  niav  ha.e  a  siiiall 
1  ...i'me  state  (iiumbc'  e.l  .'•e,.,siers)  t-i;'  .i  ■  .i'  '  lO.e'  m;  i.ii  slaie  (.  ache-).  l.O'w  level  task 

,  iieeossanlv  take  e.ire  o|  ihe  over!  c  ..,:  .  e  ■, ....lies'',  i-enhei,  one  can  only  have 

Uid'.'ix’ndeni  eoniexis  wuiioui  ci'inp  e;  v  ,  r-:.,i.iow  mg  tile  cost  of  .M  l  ii.irdware. 


a  ‘  i.ihoics  and  the  I  Itraeomputer 

A  .'  ;r.,.pi.s,  the  most  commo.ily  supported  Icat'.ue  Tu  ^  '■  e  u.m'o.vi  o  -m  epi  ojmui  to 

:  e:  .iiC  alue  of  a  memory  loeaiioii,  proecs'.. len.i'  ,m.  tl.e:  j'lv.  es.or  hv  '.vriimg  mto  a 
.ti  die  other  processor  k'.'ci's  madme  lo  sen-,'  e  ■  I'.Uier  1  t'loirjii  dicmeiicallv .  ii  is 
'|K  .eeiK  such  synehroi  i/uiion,  wiih  nlc'.u  ’  v;  i.  -ne  i  ;  o:se:,itiCins.  I'.e  t.isk  is 

.  '  .implc;  'widi  an  atomic  U  s!  \.M)  sr  i  msiruclion  ’•  '  |  e  ;>'werh  I  eiioupli  to  inplcmeni 
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all  lv[X's  ol  syikhroni/alion  paradigms  momio'ied  earlier.  However,  the  synchroni/alion  eo:-!  ul  using 
such  an  instruction  can  Iv  very  high  RssemialU .  die  processor  that  esccutcs  n  goes  into  a  husv-wai: 
^>clc  \ot  onl>  docs  the  processor  get  biocked,  it  gcneraics  extra  mernory  retcrcnccs  at  c\er\  nistmction 
cycle  until  the  ThST  .AND  sr  r  instruction  is  executed  successlully .  Implcnieniations  oi  Ihs  r  a.nd  sin  that 
[vrinit  non-busy  waiting  imply  coni>-xi  switching  in  the  processor  and  lims  .ire  no;  ik.cssanK  cheap 
either 

It  is  fxissihle  to  improve  upon  the  I'hST  A.Mi  si-.r  instruction  in  a  muiiiproccst.or  sctnng,  as  suggemted  bv 
the  N't  r  ritracomputcr  group  |17|.  Their  tcchniciuc  can  be  illustrated  by  the  atomic  tl'l.'f  wiasop  . 
instruction  (an  cudution  of  the  RFPLAc'K  .AOl'  instniction).  I'he  mstiuction  rccjuires  im  tress  :>i,d 
'..ilue.  and  wauks  as  lollows.  suppose  two  processors,  i  and  j.  sinuiitaneoush,  execute  ■(  o  ti  'x.i  aI'C' 

msiructions  wi'.h  arguments  (.A.x^i  aiul  i.A.xm  mspeclively.  After  one  mstruciio’,  ^  ccie  ''n.  ..(eiicnts  ..x 

will  become  {A  i+vy-t-xy.  F^rocessors  i  and  j  will  receive,  respectively,  either  ( A  i  and  i  A \  I'l  ;  I’-t-v ,  and 
(.■\)  as  results.  Indeterminacy  is  a  direct  consei|uence  of  the  race  to  update  menuuy  cell  .a 

.\n  archi.cct  must  chivise  between  a  wide  variety  of  implemcniatior:-  !o'  r.  Ii  >i  \  One 

possibility  Is  th:.t  the  processor  may  mterpret  the  msiruction  witii  .1  serii  •  -l  m.'x  pr  r  ;oc  u  .  lac.  oro 

While  pos-ible,  such  a  solution  dtx-s  not  find  much  favor  because  it  wi  l  _,,usc  .-  "lucrable  memory 
traftlc.  .\  second,  scheme  implements  fTK'H  AND  <()!>>  m  the  memoty  lonin  fo-  c'-,  s  is  ne.  a  :c'na'ivc 
chosen  by  die  CTD.AR  project  Idkln  I'his  typically  msults  in  a  signiiicai.t  reduchoci  o'  'ic.work  traffic 
Ix'ca.usc  atomicity  ol  memot;.  transactions  Imm  the  memory’s  contivi’l  1  ..ippiois  Ic'  .  'I'hc  seneme 
suggested  oy  the  N'lT  Tliracompulcr  grrmp  implcmeiiLs  the  insirjc.,.-n  in  liw  01  .u,  nnJv..  ^  f  the 
netwi  ir'K. 

This  implementation  calls  for  a  eomhinirti;  packet  communicaiion  nctwoik  which  connecs  r  processors 
to  an  n  fxin  memory.  If  two  packets  collide,  say  FFicn  anf»  ADOt.A.v.t  and  i  ru  H  and  \di.)oA,x  ■,  the 
switch  extracts  the  xalues  vj  and  \  .  forms  a  new  packet  (FETt.'H  and  aI'idiA.x  -‘v  m,  lortarus  il  to  the 
memory,  and  stores  the  value  ol  v  temporarily.  When  the  memory  reiuni-  the  old  '  alue  of  location  A. 
the  switch  rotunis  two  values  o  .A  1  and  (.Aov.i.  'Ilte  main  impnnciTioni  is  tliai  some  s- uchroni/ation 
Mlualions  which  would  have  taken  Oini  time  can  be  done  in  Ot/og-D  time.  It  should  lx  noted,  however, 
that  one  memory  reference  may  involve  as  many  as  10^2^  additions.  ;md  implies  suusiantial  hardw'are 
complexitx  [•uaher,  the  issue  ol  processor  idle  time  due  to  latency  ha;,  not  Ixen  adda'ssed  at  all.  In  the 
worst  case,  the  complexity  ol  hardware  may  actually  increase  liic  latency  oi  going  dirough  die  switch  and 
thus  completely  overshadow  the  ad\  aniaae  of  'combining  '  over  other  simpii.  r  in.  jilcmentaii  .ins 

The  simulation  results  reported  bv  NYC  j  17j  show  quasi-lincar  speedup  on  the  Thracomputer  (a  shared 
memory  machine  with  ordinary  von  Neumann  processors,  employing  FFTCM  AND  ADD  synchronization) 
for  a  large  variety  of  sciciUilic  applications  Wc  are  not  sure  how  to  n.'ciprei  these  results  without 
knowing  many  more  details  o!  ifieir  smuilalion  model.  Two  possible  mienirei  '’>ons  arc  the  following: 

1.  F-’ar.dlel  branches  ol  a  computation  hardly  sham  any  data.  Uiiis,  the  costly  munuil  “  ..rlusior. 
sy  nchronizaiion  is  rarely  needed  m  real  applications, 

2.  The  synchrom/ation  cost  ot  using  shared  data  can  be  acceptably  brought  down  by  iiidicious 
use  of  cachable/noii  cachable  annotations  111  the  source  prt'graii; 

The  second  point  may  Ixxomc  dearer  alter  reading  the  next  section. 


4.4.  (2ache  Coherence  .Mechanisnts 

While  highly  succcs,sful  lor  reducing  memory  latency  in  uniprocessors,  caches  in  a  niultipniccssor 
setting  introduce  a  serious  synchronization  problem  called  emhe  rohererur.  Ccnsier  and  F-cautner 
i  lf)I  deline  Ihe  problem  as  follows:  "A  memory  scheme  is  coherent  if  the  value  returned  on  a  LOAD 
instruction  is  always  the  valiu’  ^iven  by  the  latest  STORE  instruction  with  the  same  address  .  It  is  easy  to 
sec  that  this  may  be  dilTicult  to  achieve  in  multipnxcssing. 


i 


I^ 


rave  a  iwo-proecssor  system  iighilv  ,,  |  arh 

.  Ssvi  I'lj.’  its  I'  A'H  ^3chc  to  which  u  has  cxclu-'>'vc'  ^  '■  aiv  runnma. 

one  on  each  processor,  and  we  knciw  that  the  tasks  aie  o.c'-i;'.:'.-,!  ;.\  .. -.’niniuni jate  ihroue'  one  eo  rnoiv 
s'.aicd  memory  cells.  In  the  ab.scnce  of  caches,  this  scheme  can  fi-  r.'.adc  lo  wank.  Hi' ac.  er,  0  n  hupjx'nN 
ihai  the  shared  address  is  present  m  both  caches,  the  mdiv  I..  i:  read  an  i  ".nic  ihe  addics^ 

a!..  ■uv(r  sec  any  changes  caused  by  the  other  pixicessm.  t  .>  -i, .re-threugh  design  uisie.al  o'  a 
,>iorc-in  design  does  not  .solve  the  problem  cither.  What  is  loen .;i!>  legti  red  iv  a  mech.mivi.i  wiach,  ui>o:i 
me  occurrence  of  a  STttRE  to  location  x.  invalidates  copies  oi  lo^.ition  i  in  ;aehes  ol  otiier  proce.ssors,  and 
guaianices  that  sub.sequent  LOADS  will  get  the  most  reecn.  i.a.  Ix'di  -vad.  Ims  can  ineu'  MenniL„n[ 
a'.cdvead  in  leiiiis  of  decreased  memory  bandwidth. 

\i'  solutions  to  the  cache  coherence  problem  center  ■mao'.i  n'l  ,■  or  :t‘.e  coo  m  detecing  '-.edier  than 
amending  die  txi.ssibility  of  cache  incoherence.  Gcncrallv .  .v/w.c  lodicaiiiig  wneiitei  the  eaehed 

dai.i  IS  private  or  shared,  read-only  or  read- write,  etc.,  is  asc(.v;.iu  d  a  tn  e.icli  v  ache  eivr. .  However,  this 
■:.:te  somehow  has  to  be  updated  after  each  memory  iciciencc  !■  'pico  eniaiio.o  ol  tins  .dea  are  general]-, 
’’iirmtahle  except  possibly  in  the  domain  of  bus-orionicd  ....ilf.orvieesso.v  Hie  so-ca'ilcd  \nonp;  bus 
scvimon  uses  the  broadcasting  capability  of  buses  and  purge-  <  imm  all  caches  wtien  a  processor 

attempts  a  STORE  to  x.  In  such  a  system,  at  most  one  sdoK!-  n;vraiion  can  go  on  at  a  time  in  ilie  whole 
s\s:e..i  and,  therefore,  system  performance  is  going  to  Ix'  a  siRing  lunclion  of  the  snoopy  bus'  abilitv  to 
nandic  liie  coherence-maintaining  traffic. 

li  is  possible  to  improve  upon  the  above  .solution  if  some  a-,!  Iitu'iiul  state  inrormalum  is  kept  with  each 
caehe  entry.  Suppose  entries  arc  marked  "shared"  or  "non-.sb.ired''  .A  processor  can  freelv  read  shared 
enmes,  but  an  attempt  to  .STORE  into  a  shared  entry  immediaic'ly  cau-cs  that  address  to  .ippcar  on  the 
:•  ■  bus,  'Ihat  entry  is  then  deleted  from  all  the  oihcr  eaehes  .uv;  is  nuirfed  "non-shared"  in  the 
'’^'s-  ssor  that  had  attempted  the  STORE.  Similar  action  takes  p'.a^e  when  the  word  to  be  written  is 
I'.i  sui.i;  igom  the  cache.  Of  course,  the  main  memory  must  be  iqxl.tted  before  purging  the  private  copy 
'u'l  .  ,un  cache  When  the  word  to  be  read  is  missing  from  the  vIkIic,  the  snoopy  bus  may  have  to  first 
x\  ';.;:n  ihc  copv  privately  held  by  some  other  cache  bcfoie  jiiving  a  to  ilie  requesting  cache.  The  status  ol 
■  !.;i  an  ent"  will  be  marked  a-s  shared  in  both  caches,  Tiic  advaiiuige  of  keeping  shared/non-shared 
irnrnii.ii.,  !’  wall  every  cache  entry  is  that  the  .snoopy  (>u>  comes  into  action  only  on  cache  misses  ,ind 
'I'  :  ■  .  a.  hared  locations,  as  oppo.sed  to  all  U)ADs  and  sroKEs  Even  if  ihe.sc  solutions  work 
::>r.!ci;'rily,  bus-oriented  multiprocessors  are  not  of  much  iniercM  to  us  because  of  their  obvious 

,  1. , \  .11  '-Cul I ng. 

'.ir  as  we  can  tell,  there  are  no  known  solutions  to  caclie  coherence  for  non  bussed  machines  It 
:  '.eem  reasonable  that  one  needs  to  make  cache:-  jiarLially  vn.ible  If:  the  programnici  by  allowing 
'  u  ’  mark  data  (actually  addresses)  as  shared  or  not  sbareu  !;i  .iddition,  instructions  to  flush  an  cnirv  or 

.1  -G  ol  enincs  from  a  cache  have  to  be  provided.  Cache  nian.igmnent  on  such  machines  is  possible 

;  -  -  :  !'”.•  ,:-incepl  of  shared  data  is  well  intcgraicd  in  the  high  level  ia.iguagc  or  the  programming  model, 
s  f  . h:i\e  also  been  proposed  explicitly  lo  intcrloi.k  a  lo^uiion  lor  writing  or  to  bypass  the  cact.-e  (and 
,  't  :  necessary)  on  a  STORE;  m  either  case,  the  pcrfomiancc  goes  down  rapidly  as  the  machine  is 
‘  Irmieally,  in  solving  the  latency  problem  mu  muliipie  caches,  we  have  introduecd  the 

lom/ation  pmblcm  of  keeping  caches  coherent. 

wonh  noting  dial,  while  not  obvious,  a  dia-ct  trade  o!t  ('hen  exiv,-:  between  decreasitig  tti.’ 

:  u  .ni  and  increasing  the  cachablc  or  non-shared  da'a 
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Ill  order  10  leduee  nicnuir\  Kitoiie;.  en-'l,  it  \-  esNeiiii.iI  i'  ■!  .•  oo’  e  -.  m  Oe  eei  ihio  oi  i^sLiine.  nnjl!i[i!e. 
o\erla[''pcd  iiiemor\  rcquesis.  'i'he  processor  cuio  view  tin  nicioo; ,  .  eatiimic.i'oor;  ".ysteris  to  a 
logical  pipeline.  .As  !aienc>'  increases,  keeping  the  p![vline  full  oi.pii'>  i'  a  '  ...  e  ..  "ii'';  'cleieru.e^  will 
have  to  tv  in  tlie  pifvlinc.  We  note  that  niemoiy  ssstenis  ol  currera  vo,!  Neti-.  'iii.  ei  -  Ium.  \  'r\ 

little  capability  for  pipu’lining.  with  the  exception  o:  anaix'  reten.-ni'  s  c'  \e  i  'r  c  i-  ■  ;  .t  !.  ■■<  !.> 

behind  tins  limitation  arc  fundamental: 

! .  son  b'eumann  processors  must  observe  instruction  seL|ueiici',;’  c.  .i'!’  iin'.'-.  .a  ' 

I,  'I'ne  memory  references  can  gel  out  ol  order  in  he  p'lvln  e  .,  i.:rce  ■  lo.i  b  . 

d'  '  Hellish  memory  rcspon.ses  must  tx'  ['rosided 

One  way  to  overcome  the  first  dericicncy  is  to  ititerleavo  in.m\  '■if.earl>  .'f.^.s'  sme  '  i.‘  .  nv,  o  .  w  ...... 

we  saw  in  die  very  long  instruction  word  archiieetures  of  .Section  t  li  T'-e  x..  a  oiCi  -  .o  ^  lx 
overcome  ''y  providing  a  large  register  set  with  su-iable  re  serve:,. in  bn  ’•  vi.,.  ^  -  ■■.  •  .  '  .  ■.■ 

requirenien’.s  are  somowhal  in  conflico  The  situation  is  I'urih..'-  ;b  : 

cottimuriiccie  wuh  each  other.  Support  for  cheap  synchntni/abon  c  f  .i  f  '■  '  ■  ,,,  • 


quickly  and  to  has.  a  non-empty  queue  ol  la.sks  which  are  'caJy  to  rui'  (''n  '  .  ■  O'  ,o  •  ■  .  •  b  x 
by  inteiieai  ing  multiple  threads  ol  computation  and  providing  sooiC  mtv!.’  .'’i.  '  '  ■  .,n.  ;,i 

avoid  busy-vvails.  Machines  supporting  multiple  threads  and  l:''i.'\  af.cd.;'  ■■  • 

processes  !v, ok  loss  and  less  like  von  Neumann  machines  as  the  nurdvi  I,  '.';'cai:'  . 

In  this  section,  we  first  discuss  the  erstwhile  Denelcor  .''b,  '  :  nq 

commercially  acailable  multi-threaded  computer.  After  that  wc  oneny  di'-cc-s  dabd'o  '  n  bimics.  whiH' 
may  be  regarded  as  an  extreme  example  of  machines  with  .’■uilnplc  diiead,'.  m.iciuniw  .mi.,';  ^Mv'h 
instruction  constitutes  an  independent  thread  and  only  non-sus[X'nded  tiirc.ids  arc  'Ci  .  duicJ  to  be 
executed. 


5.1.  The  Denelcor  HEP:  A  Step  Beyond  xon  Neumann  Arcliiiecf  ure.s 

The  basic  structure  of  the  HEP  proces.sor  is  .shown  in  Figure  Tlie  pnve  ,c.r's  tlata  oath  is  tur.ll  as  an 
eight  step  pipeline.  In  parallel  with  the  data  path  is  a  control  locip  which  iaul  ites  pnxe  .s  ■'lau.v  words 
(PSW’s)  of  the  processes  whose  threads  are  to  l)e  interleaved  for  cm’.  ution.  The  dcfiy  .u.iund  the  c(''Urol 
loop  varies  with  the  queue  size,  but  is  never  shoncr  than  eight  pipe;  steps.  Tlv.v  m'oimuni  value  is 
intentional  to  allow  the  PSW  at  the  head  of  the  queue  to  initiate  an  iiistnicuc)n  fa:  :,oi  mtiirn  again  to  the 
head  of  the  queue  until  the  instruction  has  coniplolcd.  If  at  least  eight  F'.SW's,  mpasenting  eight 
processes,  can  b>e  kept  in  the  queue,  the  processor's  pi[xiline  will  remain  lull  I'his  selK-m,-  is  much  like 
traditional  )iipclining  oi  instructions,  hut  with  an  imponant  difference.  I  be  're-r- inst'cclion  dcp.'udcncics 
are  likely  to  be  weaker  here  tx'cause  adjacent  iiistinctions  in  the  pipe  am  always  liom  Jith  >:i  prci  e.\ve\ 

There  arc  2(clb  registers  in  each  processor;  each  process  has  an  index  olbei  mK  the  register  array. 
Inler-projcss.  /  e  ,  inter-iliread,  communication  is  possible  \ia  liic.se  legisiers  b;>  overlapping  mgisicr 
allocations.  The  HEPproc  ides  Fl'bl  d'Ml'TY'Kr.si  bVkit  Inis  on  each  register  and  r'  a  i.rMbl  v  hits  on  each 
word  in  the  data  memory,  .'m  inslmction  eneounienng  k.xiri'V  or  Rbs!  KVi.i)  mgistms  txliaxc'-  like  a 
.N'omi’  instruc'tuni;  die  program  counter  of  tb.-  process,  i  e  .  PSW,  which  initialed  the  insiruciion  is  not 
incremented  Tfie  process  cfleetivciy  /nesy-ivtufs  but  without  blocking  the  processor.  Wien  a  process 
issues  a  t.()\l)  or  sirtktl  inst.uclion,  it  is  removed  Irom  itie  control  loop  and  is  queued  separately  m  the 
Scheduler  I  unclioii  lain  iSfT.)  which  also  issues  ilie  memory  request.  Requests  which  are  not  satisfied 
tx'cause  ol  im[)ro[>.  r  !  t  1. 1 Ml 'ri'  si.nus  result  in  rccireulalion  ol  the  PSW  within  the  Si  l  's  loop  and  also 
III  r.ei''Mia!iec  oi  the  teque^i.  Ihc  SI  I'  maii.lie''  up  memory  resiroiisrs  wip  queued  PSW'v.  u[xlaies 
registers  ,is  necessary  .uid  reinseils  the  PSW's  m  'be  coniiol  loop 
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F'igure9:  Latency  F\iior:itio:i 


■'k'  HF;P  is  capable  up  U)  a  puini  of  usiiip  iiai.-.ii...  •  an.,! 

, n  latency.  Ai  the  same  time  it  provides  Cl!  .  '  :i.''nni;.  a'i..  'i  I’Kalnu n. 

:  rii  111  presence-bits  in  registers  and  main  memors  .:n  m-.l'  apiiri'ach  diies  nni  y.'  lar 

ix.cau.se  there  is  a  limit  of  one  ouistandine  •  ■r\  n  ,ra  vi  p,  r  procos'-.  and  the  n' 

:  ’'.iimn  through  shared  registers  can  be  b;ga  .  ...  '  .•>■'  -n  pn.,  ..-.ss'-  nn  i, 

.  .  '.....  .A  senous  impediment  to  the  soliwaiv  ds  m  . ■(■  :;•■  ■■  '  :•*  a  -  the  li.i.i'  n'  h-t  I'sA  v  in 

1,  .esMir.  Though  only  K  PSW'.  may  be  reij'"’.  .'•■■■  -  ;;,n  ..  .  ;;  .  ^ 

neediest  to  name  all  amcurreiii  tasks  o(  a  prt'g'cm: 

:  1  Itjtallow  .Architectures 

,.:i'  architectures  [2,  I  .*1.  2  1 .  2 '!  represent  a  r..i.,  ..  on  \enm,i:':n  ii .  .'iiieetnivs 

i:kv  use  dataflow  graphs  as  ilieir  mactiine  l.c:  ,i,’  .'  .  i  e  .j-h--.  .i-^  opts'^ew  ;  ' 

n.  'n  n,ii  inachme  languages,  sjx'cifs  only  a  p.ir,,.,:  .n  ,n  ■  .munons  inn  ;;;io 

.  ..;  ippinunitics  for  parallel  and  pi[X'lined  exei,.,'.  ■  t  .  I  i  n-.o rijnn.'iiN  I,” 

■a.'iip.  tf’c  dataflow  graph  lor  the  expression  a*b  *  *  ’  n  'n  ■  m  e  tsnh  u;e'npli.„o.i,u  ;;k  tv 

o.'d  before  the  addition,  hovsever.  the  muliiplii,  .us m-  .  .  e;..;  :ii  older  or  even  in 

f  he  advantage  of  this  (Icxihiluy  becomes  apti  -n  ■  ■  o,  ,^  .■  p.,.  nrder  in  -.vi'iKh  a, 

.in.'  d  will  ficeome  available  may  not  N.'  kiiowii  ,  \  .mpie,  .  i'nipui.iinn  s  lor 

.1'  I  '.  a  and  b  may  take  longer  than  ei'iDpuiaiionK  tc.  \  :.hcr  jv  ,  1 1;  liiti  ;  .  dial 

i.  let  h  different  n^xminds  m.n  var\  du  ■  '  ■  !^,!n■  .  ;  ;i,:ni,'nsiuK  ol  die 

laal, allow  graphs  do  not  force  uiitieeessaio  -.‘..j  ,  ■  t.i'.dl.  a  ['mcns^ots  schedule 

n.ii  >  .iccording  to  the  availabilily  of  tlie  ojx’ran  i 

Uv  I  ii  I’l  e  xeculion  lues  hanism  ('I  a  datalU '.V  ; '  ...  c,  '  :c; .  "i  i  ■ . 'in  i >  i  a 
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Kijiiire  10;  'Che  MIT  Tjeged-Token  Daullow  MaJii;;i’ 


\(ni  Neumann  processor.  We  will  bnelly  illusiraic  this  using  the  MIC  'I'agged  Token  architecluix'  (see 
I'lgure  10),  Rather  than  following  a  ProKnirn  Countrr  for  Uic  next  instruction  to  tx'  executed  and  then 
leiching  operands  for  that  instruction,  a  dataflow  machine  provides  a  low-level  synchroni/alion 
mechanism  in  the  fonm  of  'A'aitin^-Matchinf’  section  which  dispatches  only  those  instructions  for  which 
data  arc  already  available.  This  mechanism  relics  on  tagging  each  datum  with  the  address  of  the 
instruction  to  which  it  belongs  and  the  context  in  which  the  instruction  is  being  executed.  One  can  think 
ol  the  instruction  address  a.s  mplacing  the  program  counter,  and  the  context  identifier  replacing  the  frame 
base  register  in  traditional  von  Neumann  architecture.  It  is  the  machine's  job  to  match  up  data  with  the 
same  tag  and  then  to  execute  the  denoted  instmetion.  In  so  doing,  new  data  will  bo  produced,  with  a  new 


'iiic^ling  the  successor  i...,!r;siior!!sh  liius.  cacu  -i.-i.c  ;xralion. 

.hat  Lite  number  of  synchro.'. icatii'n  ;  .-i  .  '.■c'  '  '  *  ‘  ■  >  i  >  a  '”,,  K  >  ,  n  is.iy  bo  made 

1  edi  ’.argot  lhait  the  si/e  of  the  register  array  in  a  von  bcumar.n  nnehtne  Ntite  eisM  ,i,a'.  liio  ,;mccssor 
;:ie  IS  non-biocking:  given  that  die  operand'  *oi  ar.  -iv-voiioi,  ar-  a'  .ua.'i.  i  .0  a..',  jsponding 
•  iru.'iinn  can  be  executed  without  further  ssTichroni/ation. 

eeJition  to  the  waiting-matching  section  which  use'  !.'ri'.:.iri!y  for  oImi-.t.  i,  -^h.-duline  01 
’  ..."as-',',  the  MIT  Tagged-Token  machine  provule-  ■■  -iio  harism  called 

,s  ■.  u'f  arorugc.  Each  word  of  I-sirticlure  stetage  i:...-.  ti  ■  ;  ..ci .  .:.  J  ■■'t;  ;  to  ■  a  ■  wheilier  the 

'  t:  1'.  empty,  full  or  has  pending  read  tvC,uc;-t.s.  <'■  ;  l-pcc.'  e'ee'.i'ior,  a 


yi\a*'jcer  of  a  data  structure  with  the  cons.-.i  er  of  that  data  'fa  .;t  ■.  "y.  '  ■  '■  . .'■  ■y's  at  the 

'..  oh  level  to  manipulate  l-structure  st-'raec.  Those  am  allocate  to  c.noo'ate  a  empr  wnr..-^  oi  storage, 
i..  ie.^h  the  contetus  of  the  i'*"-  word  ol  ar.  ..i  :c>  a..!  ■.  .;co  .  \  J..,:  :  ..  ...tied  word. 

.  .  lall  *  softwan;  concerns  dictate  that  a  word  Ix;  written  ,nio  j,  ws'  f, .t  ...  d.^.d^  eated.  The 

;  .V,  p'osossor  treats  all  1-structua*  upeiitlur.L'  a  .vp/o /'.k..,..  .  e\.;..i,.Is.  the  sclcci 

'  ‘  ■:  .in  Is  evi'cuted,  a  packet  containing  the  tag  of  the  'des!i''aM.i'  :n:  im  ;:lon  of  the  ele.;  msiruction  is 

';•!  ■I'deti  to  ’he  proper  address,  po.ssibly  in  a  dis'ani  -s...,,i'  :l  sii-,-  e  'Uf'dclc,  a.  iiial  mcmorc 
•pc'-  eat  m.jy  lequirc  waiting  if  the  da'e  is  not  nrey  'nt  a'  -'  t'^"'-  nhe  ’■esult  may  be  n turned  mans 


"  .  .ucr.  ittnes  later.  The  key  is  that  inc  msiruclimr  pqxiiii^  .'.ceu  ra.i  lx;  susticiiueii  dui.ng  diis  time, 

r  ather,  processing  of  other  instructions  may  conunuc  !iiimeaiatel>  aiicr  initiation  01  me  operation. 
,i  .  lung  of  memory  responses  with  waiting  Instructions  is  done  via  tags  m  the  waiting-matching  section. 

One  advantage  of  tagging  each  datum  is  that  data  from  ddfcrcni  contexts  can  be  mixed  frcelv  in  ilic 
I'wni.'tion  execution  pipeline.  Thus  instmetio" -level  p  i;c!':hs:’i  ■  i.i'.'i’ow  ; ouphs  c.n  ..'ffeciivcK 
'  ’!’■•  co’^mui'icaiion  latency  and  .nmimizc  the  lo.s-.-:.  due  to  'sc.c'-c  ■'a  :-  'V’  ho[x;  it  is 

;  •  .■  ..oni  the  prior  discussion  that  even  the  most  high.h,  pqx!  -.cd  \  hdanann  processor  cannot  match 
.nc  d,  x’hility  of  a  dataflow  proccs.sor  in  this  regard.  more  complete  discussion  of  daiallow  machines  is 
o:  ,t  ih'  scr.^x’  of  this  paper.  An  overview  of  oxcet  "n>'  p-ogra-  r. .  m .  ’'dlT  fagged  ''■..den  Dataflow 
;:i;'.'‘iinc  can  be  found  in  [6].  A  deeper  undcrsiar.dlng  O  dutat'ov.  ;;..icliincs  can  be  goiien  from  [2j. 
^ddiiionaJ,  albeit  slightly  dated,  details  of  tire  machine  and  du  acsviU^iion  .-.ci  arc  given  in  [.'M  and  |5|, 
' te .  li  * .  'y. 


'  '  '  ■■’Cdtrion.s 

'A  .  h.:.' presented  the  loss  of  performance  due  to  'ncreaseJ  latcri.  y  .md  wan  ‘  1:  chroni/.uion 

.1.-.  '.’'0  two  fundamental  issues  in  l.hc  design  ,;,c  'f';:  .  m  a  large 

■  ■  ifi  lopimdent  of  the  technology  difference-  .  ..  did  mactmn  .  L.^t,  diough  we 

ri-o"  "lot  prc.sonicd  it  as  such,  ihc.sc  issues  are  also  independent  ol  die  l.igh-level  prograinm.ng  model  uwd 
n.  I'ipinee.ssor.  If  a  multiprocessor  is  btiill  ota  ■  ...v.cntiO-  ..I  nacrop-oe.  -sors,  then  degr.idation  in 
..  .  '-ir  .r.,-..  n;je  to  latency  and  .syncfironi/ation  w-iii  .  ieg,.r.:ie^^  .n  '.vhetl.e:  .!i...iL-i.-momor\, 

.  ■  igc  passing,  rcduction  or  dataflow  programming  model  is  cmpio  .cd, 

i'  I  txissible  to  modify  a  von  Neumann  proccs.sor  to  make  n  more  ,>ijiiahle  ,1  a  building  block  for  .1 
p.n.;ll.-!  machine In  our  opinion  the  an.swcr  is  a  quahfi'nl  '  \o'''  I'he  iwn  n'-  st  ir.irs'i.uu  ch.iractenstics 
i  .’..nitlow  processor  am  split  phase  memorv  opru.ii;-':!-  n.l  i!".'  ihilii  .  to  pm  a  ,1r  •,  nni[iui.inons 
.  ''o\i,-nncs,  instructions,  or  whatever  the  schcdiP'ng  •  •,  r  •  ■  -ri  ■  '*,"h.i.||  hio  ■'■■o  .■  d'.’  r^r \\  c 
■  ■  • '1', 'lirom/.'Mion  hits  in  llie  st-'rare  are  es'eo"  ■'  ■ '-■  r;  'p,  -r  ,  ..r  ni 

;  ,!r,.d  li  .m  However,  the  more  eonciirrenlly  .icnve  th'e.uls  ol  .■otiryn'.in.ni  we  hr. e,  the  're  "er  is  die 
'-  I  ..■11101  lor  hardwarc-supfxiried  svnchroni/ation  M.in’ev  l.ain  .vei  !  2-1  and  others 'K'  .ire  .leineK 
.  ,1  •'!  biv'd  on  these  i.ic  I-  r>T  n.  v  d' ■  .  ;•  i,.  1  1  ,  i.y..',  >  -■  'ss(ir'' ,is 

ipvinri  nrtu  es'ors 


I  *  -  ' y  y  J  J"  »  '  .  *  . 


V^ymy  y  ■  ii'»  r"  «  -1  -,P  '.i  -J  -."  '^.'« 
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riic  biggest  aptv  il  ol  von  Neumiuin  processors  is  that  they  are  uulcly  available  cUid  I'amiliar.  There  is  a 
iciiJcncN  to  cMragHiIaie  these  I’acts  iiiii'  a  belief  that  voo  Neunianti  pri'cessi'rs  arc  "simple"  and  efneient. 
A  le.iini^alU  sound  case  can  be  made  that  well  designed  •.■o^  Neumann  processors  ate  indeed  very 
etlicieni  iti  eveculing  sequential  codes  and  requim  less  memorv  oandwidth  than  datallow  processors. 
Howecer,  the  efficiency  ot  sequential  thmads  disap[X’ars  last  if  iheie  an  loc  manv  interruptions  or  if 
idling  ol  the  processor  due  to  latency  or  data-depondent  hazards  increases.  Papadop'.iufis  [31 1  is 
in\ C'ligatitig  datatlow  architectua's  which  wall  improce  the  cltlciericy  of  the  MIT  ragged-doken 
architecture  on  sequential  codes  without  sacnficing  any  of  its  dai.tllow  advantages.  We  can  asstitc  the 
reader  that  none  of  these  changes  are  tantamount  to  introducing  a  program  couiiic:  iti  the  datallow 

architcv  ’lire 

F'or  kuk  of  space  we  have  not  discussed  the  effect  of  multi -threaded  architeciincs  on  the  compiling  and 
language  issues.  It  is  important  to  reali/e  that  compiling  into  primitive  dataflow'  oixtrators  is  a  muen 
simpler  task  than  compiling  into  cooperating  sequential  tlircads.  Since  tlie  cost  of  inter-process 
communication  in  a  von  Neumann  setting  is  much  greater  than  the  cost  of  communic  ition  within  a 
process,  thea'  is  a  preferred  process  or  "grain"  size  on  a  given  architecture.  Furihcnnom.  placen'  m  of 
synchroniz.ition  instructions  in  a  sequential  code  requires  careful  planning  hccaii.se  an  inslnjcuori  to  wad 
for  a  syiichroni/aiion  event  may  experience  very  different  waiting  periods  in  different  locations  in  the 
program.  Thus  even  for  a  given  grain  si.’.e,  it  is  difnciilt  to  dccomprise  a  program  cphrnrll ..  Oaiaflow 
graphs,  on  the  other  hand,  provide  a  uniform  view  of  inter-  and  intra-procedural  synchronization  and 
communication,  and  as  noted  earlier,  only  specify  a  partial  order  to  enforce  diata  dcpendcr.  ics  among  the 
insimeiions  of  a  program.  Though  it  is  very  difficult  to  offer  a  quantitaiive  mca.sure,  we  believe  that  an  Id 
Nouveau  compiler  to  generate  code  for  a  miilii-ilireadcd  von  Neumann  eompuici  wiil  b;  significanlly 
more  complex  tlian  the  current  compiler  [41 1  whicli  generates  fine  giain  dataflow-  graplis  for  the  MIT 
Tagged-Token  datallow  machine.  Thus  daiaflow  computers,  in  addition  to  providing  soluiions  to  the 
fundamental  hardware  issues  raised  in  this  paper,  also  have  compiler  technology  to  exploit  their  full 
poiemial. 
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